Credit-card-fraud.webp

DATA

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

INTRODUCTION

Credit card fraud happens when consumers give their credit card number to unfamiliar individuals, when cards are lost or stolen, when mail is diverted from the intended recipient and taken by criminals, or when employees of a business copy the cards or card numbers of a cardholder

In recent years credit card usage is predominant in modern day society and credit card fraud is keep on growing. Financial losses due to fraud affect not only merchants and banks (e.g. reimbursements), but also individual clients. If the bank loses money, customers eventually pay as well through higher interest rates, higher membership fees, etc. Fraud may also affect the reputation and image of a merchant causing non-financial losses that, though difficult to quantify in the short term, may become visible in the long period.

A Fraud Detection System (FDS) should not only detect fraud cases efficiently, but also be cost-effective in the sense that the cost invested in transaction screening should not be higher than the loss due to frauds . The predictive model scores each transaction with high or low risk of fraud and those with high risk generate alerts. Investigators check these alerts and provide a feedback for each alert, i.e. true positive (fraud) or false positive (genuine).

Most banks considers huge transactions, among which very few is fraudulent, often less than 0.1% . Also, only a limited number of transactions can be checked by fraud investigators, i.e. we cannot ask a human person to check all transactions one by one if it is fraudulent or not.

Alternatively, with Machine Learning (ML) techniques we can efficiently discover fraudulent patterns and predict transactions that are probably to be fraudulent. ML techniques consist in inferring a prediction model on the basis of a set of examples. The model is in most cases a parametric function, which allows predicting the likelihood of a transaction to be fraud, given a set of features describing the transaction.

Methodology

Fraud detection is a binary classification task in which any transaction will be predicted and labeled as a fraud or legit. In this Notebook state of the art classification techniques were tried for this task and their performances were compared.

EDA (Exploratory Data Analysis)

There are not any null variable. The data set contains 284,807 transactions. The mean value of all transactions is 88.35 USD while the largest transaction recorded in this data set amounts to 25,691 USD. However, as you might be guessing right now based on the mean and maximum, the distribution of the monetary value of all transactions is heavily right-skewed. The vast majority of transactions are relatively small and only a tiny fraction of transactions comes even close to the maximum. As ı told you, I cant say more than about anything about other variables because of the dataset is done PCA and some privacy policy problems.

Let's take close look the we determined variables. Firstly, we see there are many outlier variables in Not Fraud transactions according to Fraud transactions. This is interesting but there are only 492 Frauds transactions so it might be due to this.

I cant see strong correlation between the variables. it reason might be pre-made PCA analysis

Feature Selection

Outlier detection is a complex topic. The trade-off between reducing the number of transactions and thus volume of information available to my algorithms and having extreme outliers skew the results of your predictions is not easily solvable and highly depends on your data and goals. In my case, I decided to focus exclusively on ML methods and will not focus on this topic.

Create Function to Find Best Algorithm

Standard ML techniques such as Decision Tree and Logistic Regression have a bias towards the majority class, and they tend to ignore the minority class. They tend only to predict the majority class, hence, having major misclassification of the minority class in comparison with the majority class. In more technical words, if we have imbalanced data distribution in our dataset then our model becomes more prone to the case when minority class has negligible or very lesser recall.

- There are mainly 2 mainly algorithms that are widely used for handling imbalanced class distribution.

SMOTE (Synthetic Minority Oversampling Technique) – Oversampling

SMOTE is one of the most commonly used oversampling methods to solve the imbalance problem. It generates the virtual training records by linear interpolation for the minority class. These synthetic training records are generated by randomly selecting one or more of the k-nearest neighbors for each example in the minority class. After the oversampling process, the data is reconstructed and several classification models can be applied for the processed data.

NearMiss Algorithm – Undersampling

NearMiss is an under-sampling technique. It aims to balance class distribution by randomly eliminating majority class examples. When instances of two different classes are very close to each other, we remove the instances of the majority class to increase the spaces between the two classes. This helps in the classification process. To prevent problem of information loss in most under-sampling techniques, near-neighbor methods are widely used.

Borderline-SMOTE A popular extension to SMOTE involves selecting those instances of the minority class that are misclassified, such as with a k-nearest neighbor classification model. We can then oversample just those difficult instances, providing more resolution only where it may be required.

Their approach is summarized in the 2009 paper titled “Borderline Over-sampling For Imbalanced Data Classification.” An SVM is used to locate the decision boundary defined by the support vectors and examples in the minority class that close to the support vectors become the focus for generating synthetic examples.

Another approach involves generating synthetic samples inversely proportional to the density of the examples in the minority class. That is, generate more synthetic examples in regions of the feature space where the density of minority examples is low, and fewer or none where the density is high.

I defined data X and y by creating Split() Function. Then I created and added functions for each SMOTE and ADASYN method. Thus, we will be able to easily apply all of them to the classification algorithms I have determined and learn which one works better.

Here we see the sampling status and running times of our functions. Thus, we can facilitate decision making for each function.

1%20fxiTNIgOyvAombPJx5KGeA.png

Confusion Matrix

A confusion matrix is a technique for summarizing the performance of a classification algorithm.

Classification accuracy alone can be misleading if you have an unequal number of observations in each class or if you have more than two classes in your dataset.

Calculating a confusion matrix can give you a better idea of what your classification model is getting right and what types of errors it is making.

1912.1566280193.png

Here I created function to calculate and visualize confusion matrix.Also, ı wanted to add the distinguishing feature selection though.

Find Best Algorithm

No, it's not good.

No

title = 'KNeighbors Classifier/SMOTE' %time Models(KNeighborsClassifier(),X_train1, X_test1, y_train1, y_test1, title) title = 'KNeighbors Classifier/BSMOTE' %time Models(KNeighborsClassifier(),X_train2, X_test2, y_train2, y_test2, title) title = 'KNeighbors Classifier/SMOTESVM' %time Models(KNeighborsClassifier(,X_train3, X_test3, y_train3, y_test3, title) title = 'KNeighbors Classifier/ADASYN' %time Models(KNeighborsClassifier(),X_train4, X_test4, y_train4, y_test4, title)

Definitely no

I think we found. 'Random Forest Classifier/SMOTE' and 'Random Forest Classifier/ADASYN' might be.

Oh, no

'XGB Classifier/ADASYN' and 'XGB Classifier/SMOTE' can be okey.

I think, no

No.

Looking at the models, I chose 4 algorithms;

After, I'm building a new function looking at the ROC, Precision Recall Curve and , Diffrent Threshold Values, Classification Report tables and making them more beautiful with visualization.

ROC Curve & Precision Recall Curve

ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds.

Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.

ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets.

Classification Report

It is one of the performance evaluation metrics of a classification-based machine learning model. It displays your model’s precision, recall, F1 score and support. It provides a better understanding of the overall performance of our trained model. To understand the classification report of a machine learning model, you need to know all of the metrics displayed in the report. For a clear understanding, I have explained all of the metrics below so that you can easily understand the classification report of your machine learning model.

Precision: Precision is defined as the ratio of true positives to the sum of true and false positives.

Recall: Recall is defined as the ratio of true positives to the sum of true positives and false negatives.

F1 Score: The F1 is the weighted harmonic mean of precision and recall. The closer the value of the F1 score is to 1.0, the better the expected performance of the model is.

Support: Support is the number of actual occurrences of the class in the dataset. It doesn’t vary between models, it just diagnoses the performance evaluation process.

Conclusion

I think Random Forest Classifier/ADASYN method is better. When we look the charts, ı see the results are beter on the Random Forest Classifier/ADASYN charts. But don't forget, did not use Future Selection methods and when I got data , the data was PCA format so see the outiler data or noisy data is very hard. I would not wanted to incorrect predict. The data already was İnbalanced.

There are 35 False Positive in Forest Classifier/ADASYN method. This means, per 284772 transactions will have 35 wrong predict. When you looking first it can be looks good. But there are doing millions transactions by customers every day in the bank. This means, the banks might lock up hundered of customers account unnecessary and this would reduce bank confidence.

This method can be developed with different methods and implement feature selection or Cross Validation methods. The data reviewing again with Neural Network or using Genetic algorithms.